This report explores a dataset containing details for approximately 114,000 loans. The dataset used was provided by Prosper company.
## [1] 113937 81
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : chr "1021339766868145413AB3B" "10273602499503308B223C1" "0EE9337825851032864889A" "0EF5356002482715299901A" ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : chr "2007-08-26 19:09:29.263000000" "2014-02-27 08:28:07.900000000" "2007-01-05 15:00:47.090000000" "2012-10-22 11:02:35.010000000" ...
## $ CreditGrade : chr "C" "" "HR" "" ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : chr "Completed" "Current" "Completed" "Current" ...
## $ ClosedDate : chr "2009-08-14 00:00:00" "" "2009-12-17 00:00:00" "" ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : chr "" "A" "" "A" ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : chr "CO" "CO" "GA" "GA" ...
## $ Occupation : chr "Other" "Professional" "Other" "Skilled Labor" ...
## $ EmploymentStatus : chr "Self-employed" "Employed" "Not available" "Employed" ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : chr "True" "False" "False" "True" ...
## $ CurrentlyInGroup : chr "True" "False" "True" "False" ...
## $ GroupKey : chr "" "" "783C3371218786870A73D20" "" ...
## $ DateCreditPulled : chr "2007-08-26 18:41:46.780000000" "2014-02-27 08:28:14" "2007-01-02 14:09:10.060000000" "2012-10-22 11:02:32" ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : chr "2001-10-11 00:00:00" "1996-03-18 00:00:00" "2002-07-27 00:00:00" "1983-02-28 00:00:00" ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : chr "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
## $ IncomeVerifiable : chr "True" "True" "True" "True" ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : chr "E33A3400205839220442E84" "9E3B37071505919926B1D82" "6954337960046817851BCB2" "A0393664465886295619C51" ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : chr "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
## $ LoanOriginationQuarter : chr "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
## $ MemberKey : chr "1F3E3376408759268057EDA" "1D13370546739025387B2F4" "5F7033715035555618FA612" "9ADE356069835475068C6D2" ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## ListingKey ListingNumber ListingCreationDate CreditGrade
## Length:113937 Min. : 4 Length:113937 Length:113937
## Class :character 1st Qu.: 400919 Class :character Class :character
## Mode :character Median : 600554 Mode :character Mode :character
## Mean : 627886
## 3rd Qu.: 892634
## Max. :1255725
##
## Term LoanStatus ClosedDate BorrowerAPR
## Min. :12.00 Length:113937 Length:113937 Min. :0.00653
## 1st Qu.:36.00 Class :character Class :character 1st Qu.:0.15629
## Median :36.00 Mode :character Mode :character Median :0.20976
## Mean :40.83 Mean :0.21883
## 3rd Qu.:36.00 3rd Qu.:0.28381
## Max. :60.00 Max. :0.51229
## NA's :25
## BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss
## Min. :0.0000 Min. :-0.0100 Min. :-0.183 Min. :0.005
## 1st Qu.:0.1340 1st Qu.: 0.1242 1st Qu.: 0.116 1st Qu.:0.042
## Median :0.1840 Median : 0.1730 Median : 0.162 Median :0.072
## Mean :0.1928 Mean : 0.1827 Mean : 0.169 Mean :0.080
## 3rd Qu.:0.2500 3rd Qu.: 0.2400 3rd Qu.: 0.224 3rd Qu.:0.112
## Max. :0.4975 Max. : 0.4925 Max. : 0.320 Max. :0.366
## NA's :29084 NA's :29084
## EstimatedReturn ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## Min. :-0.183 Min. :1.000 Length:113937 Min. : 1.00
## 1st Qu.: 0.074 1st Qu.:3.000 Class :character 1st Qu.: 4.00
## Median : 0.092 Median :4.000 Mode :character Median : 6.00
## Mean : 0.096 Mean :4.072 Mean : 5.95
## 3rd Qu.: 0.117 3rd Qu.:5.000 3rd Qu.: 8.00
## Max. : 0.284 Max. :7.000 Max. :11.00
## NA's :29084 NA's :29084 NA's :29084
## ListingCategory..numeric. BorrowerState Occupation
## Min. : 0.000 Length:113937 Length:113937
## 1st Qu.: 1.000 Class :character Class :character
## Median : 1.000 Mode :character Mode :character
## Mean : 2.774
## 3rd Qu.: 3.000
## Max. :20.000
##
## EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner
## Length:113937 Min. : 0.00 Length:113937
## Class :character 1st Qu.: 26.00 Class :character
## Mode :character Median : 67.00 Mode :character
## Mean : 96.07
## 3rd Qu.:137.00
## Max. :755.00
## NA's :7625
## CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower
## Length:113937 Length:113937 Length:113937 Min. : 0.0
## Class :character Class :character Class :character 1st Qu.:660.0
## Mode :character Mode :character Mode :character Median :680.0
## Mean :685.6
## 3rd Qu.:720.0
## Max. :880.0
## NA's :591
## CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines
## Min. : 19.0 Length:113937 Min. : 0.00
## 1st Qu.:679.0 Class :character 1st Qu.: 7.00
## Median :699.0 Mode :character Median :10.00
## Mean :704.6 Mean :10.32
## 3rd Qu.:739.0 3rd Qu.:13.00
## Max. :899.0 Max. :59.00
## NA's :591 NA's :7604
## OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 0.00 Min. : 2.00 Min. : 0.00
## 1st Qu.: 6.00 1st Qu.: 17.00 1st Qu.: 4.00
## Median : 9.00 Median : 25.00 Median : 6.00
## Mean : 9.26 Mean : 26.75 Mean : 6.97
## 3rd Qu.:12.00 3rd Qu.: 35.00 3rd Qu.: 9.00
## Max. :54.00 Max. :136.00 Max. :51.00
## NA's :7604 NA's :697
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 114.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 271.0 Median : 1.000 Median : 4.000
## Mean : 398.3 Mean : 1.435 Mean : 5.584
## 3rd Qu.: 525.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :14985.0 Max. :105.000 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5921 Mean : 984.5 Mean : 4.155
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :83.0000 Max. :463881.0 Max. :99.000
## NA's :697 NA's :7622 NA's :990
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 3121
## Median : 0.0000 Median : 0.000 Median : 8549
## Mean : 0.3126 Mean : 0.015 Mean : 17599
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 19521
## Max. :38.0000 Max. :20.000 Max. :1435667
## NA's :697 NA's :7604 NA's :7604
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.000 Min. : 0 Min. : 0.00
## 1st Qu.:0.310 1st Qu.: 880 1st Qu.: 15.00
## Median :0.600 Median : 4100 Median : 22.00
## Mean :0.561 Mean : 11210 Mean : 23.23
## 3rd Qu.:0.840 3rd Qu.: 13180 3rd Qu.: 30.00
## Max. :5.950 Max. :646285 Max. :126.00
## NA's :7604 NA's :7544 NA's :7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months DebtToIncomeRatio
## Min. :0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.820 1st Qu.: 0.000 1st Qu.: 0.140
## Median :0.940 Median : 0.000 Median : 0.220
## Mean :0.886 Mean : 0.802 Mean : 0.276
## 3rd Qu.:1.000 3rd Qu.: 1.000 3rd Qu.: 0.320
## Max. :1.000 Max. :20.000 Max. :10.010
## NA's :7544 NA's :7544 NA's :8554
## IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey
## Length:113937 Length:113937 Min. : 0 Length:113937
## Class :character Class :character 1st Qu.: 3200 Class :character
## Mode :character Mode :character Median : 4667 Mode :character
## Mean : 5608
## 3rd Qu.: 6825
## Max. :1750003
##
## TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. :0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:1.00 1st Qu.: 9.00 1st Qu.: 9.00
## Median :1.00 Median : 16.00 Median : 15.00
## Mean :1.42 Mean : 22.93 Mean : 22.27
## 3rd Qu.:2.00 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :8.00 Max. :141.00 Max. :141.00
## NA's :91852 NA's :91852 NA's :91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.61 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :91852 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6000 Median : 1627
## Mean : 8472 Mean : 2930
## 3rd Qu.:11000 3rd Qu.: 4127
## Max. :72499 Max. :23451
## NA's :91852 NA's :91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.00 Min. : 0.0
## 1st Qu.: -35.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -3.22 Mean : 152.8
## 3rd Qu.: 25.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :2704.0
## NA's :95009
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 6.0 1st Qu.: 37332
## Median :14.00 Median : 21.0 Median : 68599
## Mean :16.27 Mean : 31.9 Mean : 69444
## 3rd Qu.:22.00 3rd Qu.: 65.0 3rd Qu.:101901
## Max. :44.00 Max. :100.0 Max. :136486
## NA's :96985
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 Length:113937 Length:113937
## 1st Qu.: 4000 Class :character Class :character
## Median : 6500 Mode :character Mode :character
## Mean : 8337
## 3rd Qu.:12000
## Max. :35000
##
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## Length:113937 Min. : 0.0 Min. : -2.35
## Class :character 1st Qu.: 131.6 1st Qu.: 1005.76
## Mode :character Median : 217.7 Median : 2583.83
## Mean : 272.5 Mean : 4183.08
## 3rd Qu.: 371.6 3rd Qu.: 5548.40
## Max. :2251.5 Max. :40702.39
##
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-664.87
## 1st Qu.: 500.9 1st Qu.: 274.87 1st Qu.: -73.18
## Median : 1587.5 Median : 700.84 Median : -34.44
## Mean : 3105.5 Mean : 1077.54 Mean : -54.73
## 3rd Qu.: 4000.0 3rd Qu.: 1458.54 3rd Qu.: -13.92
## Max. :35000.0 Max. :15617.03 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : -94.2 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -14.24 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7000 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.00000
## Median : 0.00 Median :1.0000 Median : 0.00000
## Mean : 25.14 Mean :0.9986 Mean : 0.04803
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :21117.90 Max. :1.0125 Max. :39.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
Our dataset consists of 81 variables, with almost 114,000
observations. # Univariate Plots Section
The count of plots does not seem that it has some common distribution. It seems to me that the amount of lendet money is pretty random. Of course we can see that the majority of loans is made for a “small” amount of money, but we can also see some peaks at around 10,000$ and 15,000$. It is also interesting to see that there is a gap between 25,000$ and 30,000$. I wonder hot this plot will look like with the categorical variables of Employment status, whether the borrower is homeowner and what income does borrower have.
Based on this plots we can see that most people who take loans are
employed, it does not depend if borrower is homeowner, because the
number of lenders is same in both categories. However, we can see
interesting distribution of income levels.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2251.5
After discovering that in the monthly payment is 0$ in some loans I might omit these values in future observations. Also we can see that the highest monthly payment is 2251.5$ which is pretty high. I am also interested if these monthly payments are somehow related to the APR (annual percentage rate) on concrete loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.653 15.629 20.976 21.883 28.381 51.229 25
I have transformed BorrowerAPR variable into BorroweAPRinPercent variable, because for me it is better for understanding in that numerical form. I hope for you too. There also were some NA values, so I replaced them with the median of the value. We see that the percentages of APR vary a lot from 10% to 40%.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.00 9.00 9.26 12.00 54.00 7604
Based on this plot we can see that the most people tend to have between 6 and 12 credit lines open, which seems a lot too me.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750003
We could not see a lot of information based on first graph. However, after creating a summary and limiting x axis to the 0.95 quantile the graph gets clearer. Also this graph confirms that the ranges of income are correct, because salary has equivalent distribution.
To understand which in which year people took loans the most we can look at the LoanOriginationYear. Which is a new varible which I created from LoanOriginationDate.
Also it is interesting to see whether there is a dependency between month of a year and borrowing money (taking a loan).
From the first graph of DebtToIncomeRatio we do not see a lot, because there are some extreme values. On the second graph if we change the x scale the graph gets clearer. We can also examine DebtToIncome ratio and see if there is some trend between this value and loan taking.
Last but not least thing to look at is the loan status, we can see that in this dataset the most of the loans are in the current state, which means that they are active, and the second biggest chunk is in the status of completed.
There are 113,937 loans in the dataset with 84 features (originally there were 81, but I added some features for the better understanding of dataset), in my opinion the most interestring of them are: LoanOriginalAmount, EmploymentStatus, IsBorrowerHomeowner, IncomeRange, MonthlyLoanPayment, BorrowerAPRinPercent, OpenCreditLines, StatedMonthlyIncome, LoanOriginationYear, LoanOriginationMonth, DebtToIncomeRatio, LoanStatus. We can see that there is no some distribution of the ammount of loans, but in BorrowerAPR there is slightly pattern which may remind normal distribution. Other observation:
The main features are LoanOriginalAmount, BorrowerAPR and Term of loan. Based on that information we can calculate how much money will person pay, and in case that the person is not paying in time we can calculate how much will the person overpay for that loan, because of delay.
To support investigation the features as EmploymentStatus, IncomeRange and MonthlyLoanPayment might be useful. Also we can maybe look at loans which were not payed and find some dependency to predict if the loan requster will be able to pay it or no.
Yes I created three new variables, I transformed BorrowerAPR to BorrowerAPRinPercent to better understand the numbers in terms of graphs (for me and maybe for somebody else these numbers may be more readable than the decimals). Also I have split the LoanOriginationDate into months and years to find some trend in these data.
I did not see any unusual distribution. I hoped that I will see more trends going on and be able to spot some trends in this data, but it looks like the loans are very different based on each person. In my first part of this document I had to scale axis a lot to better understand the data in the graphs, also I had to limit some observations to exclude extreme values so these did not interfere with the observed data. I dont say that these extreme data are wrong, they might be right observations, but I wanted to look into majority of dataset. These extreme data might be valuable in the following parts of research.
## OpenCreditLines DebtToIncomeRatio StatedMonthlyIncome
## OpenCreditLines 1 NA NA
## DebtToIncomeRatio NA 1 NA
## StatedMonthlyIncome NA NA 1.00
## LoanOriginalAmount NA NA 0.35
## MonthlyLoanPayment NA NA 0.34
## BorrowerAPRinPercent NA NA NA
## LoanOriginationYear NA NA 0.14
## LoanOriginationMonth NA NA 0.00
## LoanOriginalAmount MonthlyLoanPayment BorrowerAPRinPercent
## OpenCreditLines NA NA NA
## DebtToIncomeRatio NA NA NA
## StatedMonthlyIncome 0.35 0.34 NA
## LoanOriginalAmount 1.00 0.94 NA
## MonthlyLoanPayment 0.94 1.00 NA
## BorrowerAPRinPercent NA NA 1
## LoanOriginationYear 0.31 0.26 NA
## LoanOriginationMonth -0.02 -0.01 NA
## LoanOriginationYear LoanOriginationMonth
## OpenCreditLines NA NA
## DebtToIncomeRatio NA NA
## StatedMonthlyIncome 0.14 0.00
## LoanOriginalAmount 0.31 -0.02
## MonthlyLoanPayment 0.26 -0.01
## BorrowerAPRinPercent NA NA
## LoanOriginationYear 1.00 -0.10
## LoanOriginationMonth -0.10 1.00
We can see some basic correlations from the summary above, but lets examine the data in more detail.
We can see that MonthlyLoanPayment has strong correlation with LoanOriginalAmount, but it is also slightly corelted with loan origination year which is interesting. We can also see that Stated monthly income is moderately correlated with LoanOiriginal amount.
We can see that there are extremely high lines in some of the LoanOriginations amounts in 150,000 and 250,000. It might be intresting to look into detail later in the research. However, we cannot see any pattern in this data, only thing that I spotted is that people have to have at least 7500 to be able to get a loan in amount greater than 250,000.
These graphs state interesting trend in the loans and total payments. We can see that on the first graph the relationship is linear, but on the second graph it becomes more linear, these gives us interesting insides to data. Moving to the third graph, we can see again the linear relationship. I think that it is quite interesting to see changing of relationship based just on limiting the axis.
I also wanted to examine how if there is some dependency between availableBankcardCredit and utilization of that bankcard. However from the graph we cannot derive any conclusion.
Now let’s move on to some categorical variables.
## loans$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 5000 7411 10000 25000
## ------------------------------------------------------------
## loans$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2052 4000 4274 5000 25000
## ------------------------------------------------------------
## loans$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 5000 6178 9800 25000
## ------------------------------------------------------------
## loans$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 8675 13500 25000
## ------------------------------------------------------------
## loans$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 9700 10366 15000 25000
## ------------------------------------------------------------
## loans$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 6000 12000 13073 18500 35000
## ------------------------------------------------------------
## loans$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2100 3033 5170 6001 25000
## ------------------------------------------------------------
## loans$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4000 4885 6000 25000
It looks like there is a trend that people which have higher income range tend to loan more money, this trend has one exclusion which are the people which have 0$ monthly income.
Again based on this plot we cannot come to any conclusion about the trend, I think that these data only show us that all people tent to get loans. Only thing that we see is that employed people get loans more.
## loans$EmploymentStatus:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2000 3000 4563 5000 25000
## ------------------------------------------------------------
## loans$EmploymentStatus: Employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 9000 9794 15000 35000
## ------------------------------------------------------------
## loans$EmploymentStatus: Full-time
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4950 6195 8000 35000
## ------------------------------------------------------------
## loans$EmploymentStatus: Not available
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2138 3225 5373 6300 25000
## ------------------------------------------------------------
## loans$EmploymentStatus: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4000 4873 6000 25000
## ------------------------------------------------------------
## loans$EmploymentStatus: Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 4000 6862 10000 35000
## ------------------------------------------------------------
## loans$EmploymentStatus: Part-time
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 1600 3000 4089 5000 25000
## ------------------------------------------------------------
## loans$EmploymentStatus: Retired
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2000 3500 4784 6000 25000
## ------------------------------------------------------------
## loans$EmploymentStatus: Self-employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7000 8123 11000 25000
It is also quite interesting to see from the statistics, that people who are not employed have same median of Original Loan amount as people with other employment status. However if we look at higher quartiles we see big differences between these two groups.
## loans$LoanStatus: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 1000 1000 1700 2500 3000
## ------------------------------------------------------------
## loans$LoanStatus: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 4500 6399 8000 25000
## ------------------------------------------------------------
## loans$LoanStatus: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2550 4500 6189 8000 35000
## ------------------------------------------------------------
## loans$LoanStatus: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 10000 10361 15000 35000
## ------------------------------------------------------------
## loans$LoanStatus: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2550 4275 6487 8000 25000
## ------------------------------------------------------------
## loans$LoanStatus: FinalPaymentInProgress
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 6500 8346 10000 31000
## ------------------------------------------------------------
## loans$LoanStatus: Past Due (1-15 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 7000 8468 12000 35000
## ------------------------------------------------------------
## loans$LoanStatus: Past Due (16-30 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 6000 8156 11129 25000
## ------------------------------------------------------------
## loans$LoanStatus: Past Due (31-60 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 6500 8534 10000 35000
## ------------------------------------------------------------
## loans$LoanStatus: Past Due (61-90 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 6000 7730 10000 25000
## ------------------------------------------------------------
## loans$LoanStatus: Past Due (91-120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1500 4000 6000 8004 11000 25000
## ------------------------------------------------------------
## loans$LoanStatus: Past Due (>120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2500 4000 7500 8281 11250 15500
From the graph and data above we can see, that the most loans are still in the current state and that the amounts of have the widest spread on x-scale. We can also see that loans which amount were not that big, were canceled which is I think a great think for the organization.
If we look at this graph we can see quite interesting thing. People take loans with lower amount in the middle of a year and to the end of a year the amount of loan grows.
Based on the years graph I suppose that there was a trend before the age 2008 when people took loans, because they had to or they just wanted to buy a property. However, we can see that after 2008 the amounts of loans were on the level of year 2006. From 2008 it took customers of this company 3 years to make comeback in the LoanOriginalAmount, finally in 2011 the company came to the same (slightly higher) amount of loans as in the 2008. From that year there was only a rise in the amount of loans which were taken, which I suppose might be because of inflation or maybe because of some other factors on the financial market.
I have found relationship between LoanOriginalAmount and LP_CustomerPayments. I also find some relationship between IncomeRange and LoanOriginalAmount. The most surprising thing for me was the dependency which I did not expect to see. The dependency about which I am talking about is between month and LoanOriginal amount. Based on observations of other shown graphs, which made me think that every loan is very unique and that there are not many dependencies between a lot of provided variables, I was quite surprised to see a trend in the graph which showed that during the middle of a year the amounts of loans are not so high as in the months at the end and at the start of a year.
I found interesting relationship between the years and months in which loans were taken. This was quite a surprising thing to see, because based on the other variables, there are not many dependencies between them.
The strongest relationship which I found was between LP_CustomerPayments and LoanOriginalAmount which is quite logical, because the higher amount was the loan, the higher amount has person to pay in order to complete this loan. However, this is not as easy conclusion as it may seem. I found it interesting because there can be wide variety of lengths for loans, but even though there is this length parameter we see almost linear relationship between the total amount and customer payment.
These plots above are pretty hard to read, because of a huge amount of data and extreme values in them. They show us the previous exploration, that there is only limited amount of data which is related to each other and due to extremes in loan cases. Let’s look at these same graphs in dots rather than in lines.
Here we can see better distribution of these graphs. Partially because, we have chosen really dependent variables as of LoanOriginalAmount and MonthlyLoanPayment.
I that particular graph we can see interesting thing, which is that people who are employed and become 50,000$+ tend to take higher loans and have higher monthly payment. We can spot that by looking at upper line which shows us that MonthlyLoanPayment of each graph gets steeper towards giher income ranges.
In the facet wrap by LoanStatus we spot relation, that completed loans had higher monthlyLoanPayment among all other graphs.
If we look at other variables and try to see some more depencdencies we cannot spot a lot of them. It seems that the theory of each loan being very individual is true.
Coming back to the dependency on time when the loan was taken we can se interesting thing in graph, in which colors determined by the year of a loan. We can clearly see that the majority of loans were taken more to the present time and that their amounts are higher than the amounts in the past.
It was the strengthened in my observation that based on a loan size the monthly payment increases. I also tried to find dependency in BorrowerAPR to connect it with some other variable but i didn’t see any dependency here.
I think that it is interesting to see that majority of loans with high amounts are still in the current state of loan.
We cannot see any particular distribution of amount of loan, but we can see interesting peaks in the round number of amount of loans. It appears that a numerous loans have been taken out on round amounts such as 5,000$, 10,000$, 15,000$, 20,000$, 25,000$
The lowest median for Status of a loan have loans which were canceled. On the other hand the highest median have loans with current state. Same proportion of loans are in the defaulted and completed state with the difference that completed loans have higher maximum amount of loan. In the graph we can also spot interesting pattern in the past due loans.
The plot indicates that there is relation between Monthly payment of a loan and Loan original amount. Based on income range of person we can also see that wealthier people could afford taking a bigger amount of a loan as the color of dots suggest.
The loans data set contains information on almost 114,000 loans, across 81 variables from around 2005. I started my observation by understanding basic variables presented in the data set and chose ones that were the most relevant for my EDA. I found myself asking questions to explore and dig deeper into a data. After performing an observation of a basic variables I combined them to see complex image of a story which these data provide.
In this data I have found strong relationship between Loan amount and monthly payment. Which may seem obvious, but there can be also differences in the term of a loan which change the monthly payment. However despite presence of this variable, the dependency is almost linear. Other interesting thing that I found is relatinship between loan taking and month of taking a loan. I discovered trends how the financial crisis in 2008 affected the loan market and also performed analysis of how variables in this data set are related.
I have come to conclusion that it is not easy to create a prediction model for approval of loan nor the prediction of amount of loan based on information provided about the person. After that I became even more fascinated with the algorithms which companies for providing loans have for assessment of their candidates. For me it seems like every case of loan is very individual and has to be handled respectfully to all occasions. If I were to continue on exploration of this data set I would try to perform an analysis, where I would look into variables such as credit card information, different indexes and so on. I also think that in this data set there is a lot of technical data but personally for me I would appreciate if there were more data on concrete person e.g. as age, education and others. This would maybe help to create a prediction model which might be useful to some extent.